Levene’s Test (Homogeneity of Variances)#
Levene’s test is a hypothesis test for equal variances across 2+ independent groups. It’s commonly used as an assumption check before methods that assume homoscedasticity (equal within-group variance), such as classical one-way ANOVA.
Learning goals#
By the end you should be able to:
state the null/alternative hypotheses and what rejecting means
explain the key trick: convert a variance problem into a mean problem
compute the Levene statistic from scratch using NumPy
interpret the output (test statistic, degrees of freedom, p-value)
use Plotly visuals to build intuition (raw samples vs deviation variables)
What it’s used for#
Levene’s test answers:
“Do these groups have the same spread/variability?”
Typical use-cases:
ANOVA / linear models: check whether residual variability looks similar across groups.
A/B/n tests: check if one variant causes outcomes to be more variable.
Quality control: check if batches/machines differ in variability.
Finance/ops: detect regime changes where volatility differs across segments.
Important:
This is a test about variances, not means. Groups can have different means and still have equal variances.
If you reject, Levene’s test does not tell you which groups differ; it only says “at least one variance differs”.
For two-sample mean comparisons, it’s often better to use Welch’s t-test directly rather than “pre-testing” variances.
Hypotheses and interpretation#
For \(k\) groups with population variances \(\sigma_1^2, \dots, \sigma_k^2\):
Null hypothesis: \(H_0: \sigma_1^2 = \sigma_2^2 = \dots = \sigma_k^2\) (all variances equal)
Alternative: \(H_1\): at least one variance differs
Decision rule at significance level \(\alpha\) (e.g. 0.05):
if
p-value < α: reject \(H_0\) → evidence that variances are not all equalif
p-value ≥ α: fail to reject \(H_0\) → not enough evidence of a variance difference
A p-value is not the probability that \(H_0\) is true. It is the probability (under \(H_0\)) of getting a test statistic at least as extreme as the one observed.
The core idea (why Levene works)#
Levene’s test turns “equal variances?” into “equal means?” by constructing a new variable:
For each group \(i\), pick a center (one of):
mean (original Levene)
median (Brown–Forsythe variant; more robust)
trimmed mean (robust)
Compute absolute deviations from the group center:
If groups truly have the same variance, then the typical size of these deviations should be similar across groups.
So we run a one-way ANOVA on \(z_{ij}\):
Where \(n_i\) is group size, \(N=\sum_i n_i\), and \(\bar z_i\) are group means of the deviations.
Under \(H_0\), \(W\) is approximately \(F\)-distributed with df1 = k-1 and df2 = N-k.
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
SEED = 42
rng = np.random.default_rng(SEED)
print("numpy:", np.__version__)
print("pandas:", pd.__version__)
print("plotly:", __import__("plotly").__version__)
numpy: 1.26.2
pandas: 2.1.3
plotly: 6.5.2
From-scratch NumPy implementation#
Below is a low-level implementation of the Levene statistic. The goal is to make the mechanics explicit:
compute group centers \(c_i\)
convert samples to deviations \(z_{ij} = |x_{ij}-c_i|\)
compute the one-way ANOVA \(F\) statistic on the deviations
For the p-value, we show two options:
a NumPy-only Monte Carlo approximation using the \(F\) distribution definition as a ratio of chi-squares
(optional) a SciPy reference value (
scipy.stats.f.sf) to match what you’d do in practice
def trimmed_mean(x: np.ndarray, proportiontocut: float) -> float:
x = np.asarray(x, dtype=float).ravel()
if not (0 <= proportiontocut < 0.5):
raise ValueError("proportiontocut must be in [0, 0.5).")
if x.size == 0:
raise ValueError("Empty sample.")
k = int(np.floor(proportiontocut * x.size))
if 2 * k >= x.size:
raise ValueError("Trim proportion too large for this sample size.")
x_sorted = np.sort(x)
return float(x_sorted[k : x.size - k].mean())
def _center(x: np.ndarray, center: str, proportiontocut: float) -> float:
center = center.lower()
if center == "mean":
return float(np.mean(x))
if center == "median":
return float(np.median(x))
if center in {"trimmed", "trimmed_mean"}:
return trimmed_mean(x, proportiontocut)
raise ValueError("center must be one of: 'mean', 'median', 'trimmed'.")
def levene_statistic(*groups: np.ndarray, center: str = "median", proportiontocut: float = 0.05):
groups = [np.asarray(g, dtype=float).ravel() for g in groups]
if len(groups) < 2:
raise ValueError("Levene's test needs at least 2 groups.")
if any(g.size < 2 for g in groups):
raise ValueError("Each group must contain at least 2 observations.")
if not all(np.isfinite(g).all() for g in groups):
raise ValueError("All observations must be finite (no NaN/inf).")
k = len(groups)
n = np.array([g.size for g in groups], dtype=int)
N = int(n.sum())
df1 = k - 1
df2 = N - k
centers = np.array([
_center(g, center=center, proportiontocut=proportiontocut) for g in groups
])
z_groups = [np.abs(g - c) for g, c in zip(groups, centers)]
z_means = np.array([z.mean() for z in z_groups])
z_bar = float(np.sum(n * z_means) / N)
ss_between = float(np.sum(n * (z_means - z_bar) ** 2))
ss_within = float(np.sum([np.sum((z - m) ** 2) for z, m in zip(z_groups, z_means)]))
if ss_within == 0.0:
W = np.inf if ss_between > 0 else 0.0
else:
W = (ss_between / df1) / (ss_within / df2)
details = {
"centers": centers,
"z_groups": z_groups,
"z_means": z_means,
"z_bar": z_bar,
"ss_between": ss_between,
"ss_within": ss_within,
"df1": df1,
"df2": df2,
}
return W, details
def f_pvalue_mc(W: float, df1: int, df2: int, n_mc: int = 200_000, seed: int = 0) -> float:
"""Right-tail p-value via NumPy-only Monte Carlo for an F(df1, df2) statistic."""
if not np.isfinite(W):
return 0.0
if n_mc <= 0:
raise ValueError("n_mc must be positive.")
rng_local = np.random.default_rng(seed)
f_sim = (rng_local.chisquare(df1, size=n_mc) / df1) / (rng_local.chisquare(df2, size=n_mc) / df2)
return float(np.mean(f_sim >= W))
def levene_test_numpy(
*groups: np.ndarray,
center: str = "median",
proportiontocut: float = 0.05,
n_mc: int = 200_000,
seed: int = 0,
):
W, details = levene_statistic(*groups, center=center, proportiontocut=proportiontocut)
p_mc = f_pvalue_mc(W, details["df1"], details["df2"], n_mc=n_mc, seed=seed)
return W, p_mc, details
def plot_raw_vs_deviation(groups, group_names, centers, center_label: str):
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=(
"Raw samples",
f"Absolute deviations |x - {center_label}(group)|",
),
horizontal_spacing=0.15,
)
for name, x, c in zip(group_names, groups, centers):
fig.add_trace(
go.Violin(
y=x,
name=name,
box_visible=True,
meanline_visible=True,
points="outliers",
),
row=1,
col=1,
)
fig.add_trace(
go.Violin(
y=np.abs(x - c),
name=name,
box_visible=True,
meanline_visible=True,
points="outliers",
showlegend=False,
),
row=1,
col=2,
)
fig.update_layout(violinmode="group", height=420)
fig.update_yaxes(title_text="value", row=1, col=1)
fig.update_yaxes(title_text=f"|x - {center_label}(group)|", row=1, col=2)
return fig
def plot_mean_abs_deviation(z_groups, group_names):
mad = np.array([float(np.mean(z)) for z in z_groups])
fig = go.Figure(go.Bar(x=group_names, y=mad, text=np.round(mad, 4)))
fig.add_hline(
y=float(mad.mean()),
line_dash="dash",
annotation_text="grand mean",
annotation_position="top left",
)
fig.update_traces(textposition="outside")
fig.update_layout(
title="Group means of |x - center(group)| (what the ANOVA compares)",
xaxis_title="group",
yaxis_title="mean absolute deviation",
height=360,
)
return fig
def plot_f_null_distribution(W: float, df1: int, df2: int, n: int = 150_000, seed: int = 0):
rng_local = np.random.default_rng(seed)
f_sim = (rng_local.chisquare(df1, size=n) / df1) / (rng_local.chisquare(df2, size=n) / df2)
fig = go.Figure(go.Histogram(x=f_sim, nbinsx=80, histnorm="probability density"))
fig.add_vline(
x=W,
line_color="crimson",
line_width=3,
annotation_text="observed W",
annotation_position="top right",
)
fig.update_layout(
title=f"Null reference: simulated F({df1}, {df2}) density + observed W",
xaxis_title="F value",
yaxis_title="density",
height=360,
showlegend=False,
)
return fig
Example 1: different means, similar variances#
Here the groups have noticeably different means, but we generate them with the same standard deviation. Levene’s test should typically fail to reject \(H_0\) (equal variances).
n = 160
g1 = rng.normal(loc=0.0, scale=1.0, size=n)
g2 = rng.normal(loc=2.0, scale=1.0, size=n)
g3 = rng.normal(loc=-1.0, scale=1.0, size=n)
groups = [g1, g2, g3]
names = ["A", "B", "C"]
W, p_mc, details = levene_test_numpy(*groups, center="median", n_mc=200_000, seed=1)
print(f"NumPy Levene (median center): W={W:.4f}, df=({details['df1']}, {details['df2']}), p_mc≈{p_mc:.4f}")
# Practical reference (analytic p-value)
from scipy.stats import f as f_dist, levene as levene_scipy
p_f = float(f_dist.sf(W, details["df1"], details["df2"]))
W_scipy, p_scipy = levene_scipy(*groups, center="median")
print(f"SciPy reference: W={W_scipy:.4f}, p={p_scipy:.4f}")
print(f"F survival (SciPy f.sf): p={p_f:.4f}")
fig = plot_raw_vs_deviation(groups, names, details["centers"], center_label="median")
fig.update_layout(title="Example 1: raw samples vs absolute deviations")
fig.show()
fig = plot_mean_abs_deviation(details["z_groups"], names)
fig.show()
NumPy Levene (median center): W=1.6790, df=(2, 477), p_mc≈0.1864
SciPy reference: W=1.6790, p=0.1877
F survival (SciPy f.sf): p=0.1877
Example 2: different variances#
Now the groups have similar means but different spreads. Levene’s test should typically reject \(H_0\).
n = 160
g1 = rng.normal(loc=0.0, scale=1.0, size=n)
g2 = rng.normal(loc=0.0, scale=2.0, size=n)
g3 = rng.normal(loc=0.0, scale=0.5, size=n)
groups = [g1, g2, g3]
names = ["sigma=1.0", "sigma=2.0", "sigma=0.5"]
W, p_mc, details = levene_test_numpy(*groups, center="median", n_mc=200_000, seed=2)
p_f = float(f_dist.sf(W, details["df1"], details["df2"]))
print(f"NumPy Levene (median center): W={W:.4f}, df=({details['df1']}, {details['df2']}), p_mc≈{p_mc:.6f}")
print(f"Analytic p (SciPy f.sf): p={p_f:.6f}")
fig = plot_raw_vs_deviation(groups, names, details["centers"], center_label="median")
fig.update_layout(title="Example 2: raw samples vs absolute deviations")
fig.show()
fig = plot_mean_abs_deviation(details["z_groups"], names)
fig.show()
# A visual way to read the p-value: where does W fall under the null F distribution?
fig = plot_f_null_distribution(W, details["df1"], details["df2"], n=150_000, seed=3)
fig.show()
NumPy Levene (median center): W=95.0895, df=(2, 477), p_mc≈0.000000
Analytic p (SciPy f.sf): p=0.000000
Why people often use the median center (robustness)#
The original Levene test uses the mean as the group center, but a common robust variant uses the median (often called the Brown–Forsythe test).
Intuition:
The mean is more sensitive to outliers and heavy tails.
Using the median tends to control false positives better when data are non-normal.
Below we simulate under \(H_0\) (equal variances) and compare how often each variant rejects at \(\alpha=0.05\).
from scipy.stats import f as f_dist
def pvalue_f(W: float, df1: int, df2: int) -> float:
return float(f_dist.sf(W, df1, df2))
def estimate_type1_error(
*,
gen_groups,
center: str,
n_reps: int = 600,
alpha: float = 0.05,
seed: int = 123,
):
rng_local = np.random.default_rng(seed)
rejects = 0
for _ in range(n_reps):
groups = gen_groups(rng_local)
W, details = levene_statistic(*groups, center=center)
p = pvalue_f(W, details["df1"], details["df2"])
rejects += int(p < alpha)
return rejects / n_reps
k = 3
n = 30
def gen_normal(rng_local):
return [rng_local.normal(loc=0.0, scale=1.0, size=n) for _ in range(k)]
def gen_t3_heavy_tails(rng_local):
# Standard t(df=3) has variance df/(df-2)=3, so scale by 1/sqrt(3) to make variance ~1.
return [rng_local.standard_t(df=3, size=n) / np.sqrt(3.0) for _ in range(k)]
rates = []
for dist_name, gen in [("Normal", gen_normal), ("t(df=3) heavy tails", gen_t3_heavy_tails)]:
for center in ["mean", "median"]:
rate = estimate_type1_error(gen_groups=gen, center=center, n_reps=700, alpha=0.05, seed=7)
rates.append({"distribution": dist_name, "center": center, "rejection_rate": rate})
df_rates = pd.DataFrame(rates)
df_rates
| distribution | center | rejection_rate | |
|---|---|---|---|
| 0 | Normal | mean | 0.052857 |
| 1 | Normal | median | 0.040000 |
| 2 | t(df=3) heavy tails | mean | 0.067143 |
| 3 | t(df=3) heavy tails | median | 0.041429 |
fig = px.bar(
df_rates,
x="distribution",
y="rejection_rate",
color="center",
barmode="group",
text="rejection_rate",
title="Type I error under H0 (how often we reject when variances are equal)",
)
fig.add_hline(
y=0.05,
line_dash="dash",
annotation_text="alpha = 0.05",
annotation_position="top left",
)
fig.update_traces(texttemplate="%{text:.3f}", textposition="outside")
fig.update_layout(yaxis_title="rejection rate", height=380)
fig.show()
Practical checklist + pitfalls#
Independence matters: Levene assumes observations are independent within and across groups.
Scale of measurement: works best for continuous/interval data; with discrete/ordinal outcomes interpret carefully.
What rejection means: at least one group variance differs → consider Welch’s ANOVA, robust methods, variance-stabilizing transforms, or modeling heteroscedasticity directly.
What non-rejection means: not enough evidence of variance differences ≠ proof of equal variances (power can be low for small samples).
Multiple comparisons: Levene is an omnibus test; if you follow with pairwise variance checks, correct for multiple testing.
Alternatives:
Bartlett’s test: more powerful under normality, but very sensitive to non-normal data.
Fligner–Killeen: nonparametric/robust test for equal variances.
Exercises#
Modify Example 2 so only one group has a larger variance. How does \(W\) change?
Increase/decrease sample sizes and see how the p-value behaves.
Implement a simple “follow-up” routine: plot sample variances and bootstrap confidence intervals per group.
Compare Levene (median center) to
scipy.stats.fligneron heavy-tailed data.
References#
Levene, H. (1960). Robust tests for equality of variances.
Brown, M. B., & Forsythe, A. B. (1974). Robust tests for the equality of variances.
SciPy documentation:
scipy.stats.levene